Note: This report showcases the analyses performed for the cafe, but prepared with anonymized and modified data for demonstration purposes.

Table of contents

  1. Exploratory data analysis
    1.1 Revenue decomposition by day of the week
    1.2 Revenue decomposition by type of day and season
    1.3 Trends
    1.4 Hot Chocolate Festival (HCF) impact on sales

  2. Regression analysis - HCF
    2.1 HCF impact - Item A
    2.2 HCF impact - Item B

  3. Regression analysis - individual item sales
    3.1 Understanding Item A sales
    3.2 Understanding Item B sales
    3.3 Understanding Item C sales

    Appendix 1 - Statistical tests on recommendations from section 1.2
    Appendix 2 - Regression models - multicollinearity checks
    Appendix 3 - Regression models - residual plots

1. Exploratory data analysis

1.1 Revenue decomposition by day of the week

display(HTML(sales_dist_div))
display(HTML(order_dist_div))
display(HTML(spo_dist_div))

Analysis

  • Weekends show the highest overall demand, but also exhibit the greatest fluctuations in sales (greater spread in boxplots).

  • Differences in sales between weekends and weekdays appear to be driven by variations in customer traffic, as the distribution of sales per order remains consistent across the week.

  • Weekdays demand is relatively stable, with overlapping sales distributions across Mondays to Thursdays.

  • Sales on holidays are comparable to Fridays, suggesting strong demand.

  • Wednesday holidays (e.g., Christmas Day and New Year’s Day) saw higher sales per order but fewer total orders, possibly reflecting larger group purchases or celebrations.


1.2 Revenue decomposition by type of day and season

Back to top

rev_decomp_fig.show()

Analysis

  • Summer revenue was driven by a high number of orders, however sales per order remained comparable to other the seasons, suggesting strong customer traffic rather than higher spending per visit.

  • Fall and Winter saw similar customer traffic, but Winter had higher average sales per order, possibly due to holiday-related purchases.

  • In Spring, the number of orders increased notably compared to Winter, though sales per order remained relatively stable, suggesting customer traffic starts to pick up in Spring, but customers are not necessarily spending more per visit.

  • Across all seasons except Summer, holidays experienced higher average sales per order when compared to weekends and Fridays.

  • Weekends consistently recorded higher sales per order than Fridays, regardless of season. This emphasizes the importance of the weekend traffic.

Recommendations

  • Spring and Summer show a pick up in customer traffic, consider launching new items during these seasons to capitalize on the higher traffic.

  • As the propensity to spend appear to be higher on holidays, it offers an opportunity to offer limited-time combo’s or promotions to further boost revenue.

Please refer to Appendix 1 for the statistical tests performed in support of the recommendations.


Recommendations

  • To improve revenue during the slow period, need to find ways to boost item A sales. E.g. introducing seasonal flavours, or combo with hot drinks etc.


1.4 Hot Chocolate Festival (HCF) impact on sales

Back to top

As HCF happened during the Winter, we’ll compare sales during the HCF days and non-HCF days in the Winter only for a more accurate comparison.

HCF_A_fig.show()
HCF_B_fig.show()

Analysis

  • HCF did not appear to have a strong effect on item A sales. Most days exhibited overlapping distributions, with Saturdays being the only exception.

  • HCF seems to have suppresssed item B sales, with three of the days exhibiting notably lower sales.

To supplement the graphical analysis above, we’ll perform a regression analysis.


2. Regression analysis - Hot Chocolate Festival (HCF)

2.1 HCF impact - Item A

Back to top

Here we attempt to explain daily item A sales during the Winter using the following factors:

  • rain: Amount of rainfall (mm) in Metro Vancouver. This models the effect of rainfall on item A sales.

  • snow: Amount of snowfall (mm) in Metro Vancouver. This models the effect of snowfall on item A sales.

  • is_long_weekend: Represents whether the day is part of a long weekend. This captures the effect of long weekends on item A sales.

  • is_holiday: Represents whether the day is a holiday. This captures the effect of holidays on item A sales.

  • day_of_week: Represents the day of the week (Monday, Tuesday etc.). This captures the weekly seasonality of the business.

  • is_HCF: Represents whether the day is a Hot Chocolate Festival day. This captures the effect of HCF on item A sales.

The variable of interest here is is HCF. If participating in the HCF does have a spillover effect on item A sales, is_HCF will have a positive coefficient that is statistically significant.

formula = (
    """
        item_A_sales ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(day_of_week) 
        + C(is_HCF)
    """
    )

hcf_A_model = smf.ols(formula, data=hcf_A_df).fit()

print(hcf_A_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           item_A_sales   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.767
Method:                 Least Squares   F-statistic:                     27.69
Date:                Fri, 15 Aug 2025   Prob (F-statistic):           1.60e-22
Time:                        13:57:59   Log-Likelihood:                -601.40
No. Observations:                  90   AIC:                             1227.
Df Residuals:                      78   BIC:                             1257.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                    1715.8404     61.974     27.687      0.000    1592.460    1839.221
is_long_weekend[T.True]       111.4486    132.132      0.843      0.402    -151.606     374.503
C(is_holiday)[T.True]         815.1812    121.936      6.685      0.000     572.426    1057.937
C(day_of_week)[T.Tuesday]     -80.4367     82.758     -0.972      0.334    -245.195      84.322
C(day_of_week)[T.Wednesday]    74.6287     83.686      0.892      0.375     -91.978     241.235
C(day_of_week)[T.Thursday]     22.3063     83.004      0.269      0.789    -142.941     187.554
C(day_of_week)[T.Friday]      449.5122     84.417      5.325      0.000     281.451     617.573
C(day_of_week)[T.Saturday]    809.8290     85.409      9.482      0.000     639.792     979.866
C(day_of_week)[T.Sunday]      841.2587     82.239     10.229      0.000     677.533    1004.984
C(is_HCF)[T.True]              73.7003     52.359      1.408      0.163     -30.539     177.939
rain                           -2.7682      4.069     -0.680      0.498     -10.868       5.332
snow                          -34.8685     23.917     -1.458      0.149     -82.484      12.747
==============================================================================
Omnibus:                       19.947   Durbin-Watson:                   2.254
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               27.493
Skew:                           1.016   Prob(JB):                     1.07e-06
Kurtosis:                       4.790   Cond. No.                         53.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • The coefficient for is_HCF is not statistically different from 0, indicating there is no strong evidence for a spillover effect on item A from participating in the HCF, after controlling for other variables. The positive coefficient could be interpreted as weak evidence that HCF did drive some item A sales, but not by a significant amount.

  • Item A sales during the Winter was still primarily explained by the weekly seasonality and holiday effects.


2.2 HCF impact - Item B

Back to top

We will be following the same modelling approach from above for HCF’s impact on item B.

formula = (
    """
        item_B_sales ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(day_of_week) 
        + C(is_HCF)
    """
    )

hcf_B_model = smf.ols(formula, data=hcf_b_df).fit()

print(hcf_B_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           item_B_sales   R-squared:                       0.748
Model:                            OLS   Adj. R-squared:                  0.712
Method:                 Least Squares   F-statistic:                     21.05
Date:                Fri, 15 Aug 2025   Prob (F-statistic):           4.76e-19
Time:                        13:57:59   Log-Likelihood:                -604.63
No. Observations:                  90   AIC:                             1233.
Df Residuals:                      78   BIC:                             1263.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                    1663.6781     64.236     25.900      0.000    1535.795    1791.562
is_long_weekend[T.True]       424.4739    136.954      3.099      0.003     151.819     697.129
C(is_holiday)[T.True]         413.8744    126.386      3.275      0.002     162.259     665.490
C(day_of_week)[T.Tuesday]    -142.9222     85.779     -1.666      0.100    -313.694      27.850
C(day_of_week)[T.Wednesday]  -131.4728     86.741     -1.516      0.134    -304.160      41.215
C(day_of_week)[T.Thursday]    -50.0111     86.033     -0.581      0.563    -221.290     121.268
C(day_of_week)[T.Friday]      422.4017     87.498      4.828      0.000     248.207     596.597
C(day_of_week)[T.Saturday]    676.2040     88.527      7.638      0.000     499.961     852.447
C(day_of_week)[T.Sunday]      528.3789     85.241      6.199      0.000     358.678     698.080
C(is_HCF)[T.True]             -24.7872     54.270     -0.457      0.649    -132.831      83.256
rain                           -0.6680      4.217     -0.158      0.875      -9.064       7.728
snow                          -60.8765     24.790     -2.456      0.016    -110.230     -11.523
==============================================================================
Omnibus:                        0.065   Durbin-Watson:                   1.576
Prob(Omnibus):                  0.968   Jarque-Bera (JB):                0.235
Skew:                          -0.009   Prob(JB):                        0.889
Kurtosis:                       2.750   Cond. No.                         53.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Similar to item A, the coefficient for is_HCF is not statistically different from 0, indicating there is no strong evidence that item B sales during HCF differed from sales during non-HCF periods, after controlling for other variables.

  • Item B sales during the Winter were also primarily explained by the weekly seasonality and holiday effects, but compared to item A, long weekends and weather conditions also contributed.


3. Regression analysis - individual item sales

3.1 Understanding Item A sales

Back to top

Here we attempt to explain daily item A sales on a broader scale with the following factors:

  • rain: Amount of rainfall (mm) in Metro Vancouver. This models the effect of rainfall on item A sales.

  • snow: Amount of snowfall (mm) in Metro Vancouver. This models the effect of snowfall on item A sales.

  • is_long_weekend: Represents whether the day is part of a long weekend. This captures the effect of long weekends on item A sales.

  • is_holiday: Represents whether the day is a holiday. This captures the effect of holidays on item A sales.

  • season: Represents the season (Summer, Fall etc.). This captures the seasonal effects of the business.

  • day_of_week: Represents the day of the week (Monday, Tuesday etc.). This captures the weekly seasonality of the business.

formula = (
    """
        log_A ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

A_model = smf.ols(formula, data=A_sales_df).fit()

print(A_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_A   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.756
Method:                 Least Squares   F-statistic:                     81.75
Date:                Fri, 15 Aug 2025   Prob (F-statistic):           4.71e-94
Time:                        13:57:59   Log-Likelihood:                 298.17
No. Observations:                 339   AIC:                            -568.3
Df Residuals:                     325   BIC:                            -514.8
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.4547      0.019    397.498      0.000       7.418       7.492
is_long_weekend[T.True]         0.0040      0.026      0.154      0.878      -0.048       0.056
C(is_holiday)[T.True]           0.3098      0.035      8.771      0.000       0.240       0.379
C(season)[T.Spring]             0.0927      0.017      5.555      0.000       0.060       0.125
C(season)[T.Summer]             0.1597      0.015     10.317      0.000       0.129       0.190
C(season)[T.Fall]              -0.0206      0.016     -1.323      0.187      -0.051       0.010
C(day_of_week)[T.Tuesday]       0.0098      0.022      0.452      0.651      -0.033       0.053
C(day_of_week)[T.Wednesday]     0.0330      0.022      1.525      0.128      -0.010       0.075
C(day_of_week)[T.Thursday]      0.0152      0.022      0.705      0.481      -0.027       0.058
C(day_of_week)[T.Friday]        0.2339      0.022     10.733      0.000       0.191       0.277
C(day_of_week)[T.Saturday]      0.4041      0.022     18.273      0.000       0.361       0.448
C(day_of_week)[T.Sunday]        0.3536      0.022     16.108      0.000       0.310       0.397
rain                           -0.0011      0.001     -1.365      0.173      -0.003       0.001
snow                           -0.0121      0.010     -1.176      0.241      -0.032       0.008
==============================================================================
Omnibus:                       25.337   Durbin-Watson:                   1.237
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.735
Skew:                           0.509   Prob(JB):                     3.88e-09
Kurtosis:                       4.306   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Item A sales are not strongly correlated with weather conditions (rain or snow), indicating that customer behavior is relatively unaffected by these factors.

  • Variations in item A sales are primarily driven by seasonality — both in terms of the day of the week and the actual season — as well as whether the day is a holiday.

Model intepretation

To improve model performance, a log transformation was applied to item A sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.

exp_coef
coeff p-value
features
Intercept 1727.95 0.000
is_long_weekend[T.True] 1.00 0.878
C(is_holiday)[T.True] 1.36 0.000
C(season)[T.Spring] 1.10 0.000
C(season)[T.Summer] 1.17 0.000
C(season)[T.Fall] 0.98 0.187
C(day_of_week)[T.Tuesday] 1.01 0.651
C(day_of_week)[T.Wednesday] 1.03 0.128
C(day_of_week)[T.Thursday] 1.02 0.481
C(day_of_week)[T.Friday] 1.26 0.000
C(day_of_week)[T.Saturday] 1.50 0.000
C(day_of_week)[T.Sunday] 1.42 0.000
rain 1.00 0.173
snow 0.99 0.241


Base case: Average item A sales = $1727.95 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - No significant impact on sales.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 36% higher than the base case.

season - No significant difference in item A sales between Winter and Fall.

  • Compared to Winter, and all else being equal:
    • Spring sales are 10% higher
    • Summer sales are 17% higher

day_of_week - No significant difference in item A sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 26% higher
    • Saturday sales are 50% higher
    • Sunday sales are 42% higher

rain - No significant impact on sales.

snow - No significant impact on sales.

For example:

Average item A sales on a Friday in the Summer with no rain and no snow = $1727.95 x 1.26 x 1.17 = $2547.34


3.2 Understanding Item B sales

Back to top

We will be modelling overall item B sales with the same approach as above.

formula = (
    """
        log_B ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

B_model = smf.ols(formula, data=B_sales_df).fit()

print(B_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_B   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.844
Method:                 Least Squares   F-statistic:                     142.2
Date:                Fri, 15 Aug 2025   Prob (F-statistic):          1.87e-125
Time:                        13:57:59   Log-Likelihood:                 261.37
No. Observations:                 339   AIC:                            -494.7
Df Residuals:                     325   BIC:                            -441.2
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.3488      0.021    351.536      0.000       7.308       7.390
is_long_weekend[T.True]         0.0995      0.029      3.404      0.001       0.042       0.157
C(is_holiday)[T.True]           0.2682      0.039      6.812      0.000       0.191       0.346
C(season)[T.Spring]             0.1558      0.019      8.380      0.000       0.119       0.192
C(season)[T.Summer]             0.4376      0.017     25.364      0.000       0.404       0.472
C(season)[T.Fall]               0.0478      0.017      2.751      0.006       0.014       0.082
C(day_of_week)[T.Tuesday]      -0.0073      0.024     -0.299      0.765      -0.055       0.040
C(day_of_week)[T.Wednesday]     0.0294      0.024      1.219      0.224      -0.018       0.077
C(day_of_week)[T.Thursday]      0.0431      0.024      1.789      0.075      -0.004       0.090
C(day_of_week)[T.Friday]        0.3129      0.024     12.881      0.000       0.265       0.361
C(day_of_week)[T.Saturday]      0.4398      0.025     17.840      0.000       0.391       0.488
C(day_of_week)[T.Sunday]        0.3846      0.024     15.718      0.000       0.336       0.433
rain                           -0.0063      0.001     -6.772      0.000      -0.008      -0.004
snow                           -0.0348      0.011     -3.044      0.003      -0.057      -0.012
==============================================================================
Omnibus:                        5.425   Durbin-Watson:                   1.387
Prob(Omnibus):                  0.066   Jarque-Bera (JB):                6.803
Skew:                          -0.122   Prob(JB):                       0.0333
Kurtosis:                       3.650   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Item B sales is particularly weak during the Winter.

  • Item B sales are more sensitive to weather conditions, weekly seasonality and long weekends.

Model interpretation

To improve model performance, a log transformation was applied to item B sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.

exp_coef
coeff p-value
features
Intercept 1554.34 0.000
is_long_weekend[T.True] 1.10 0.001
C(is_holiday)[T.True] 1.31 0.000
C(season)[T.Spring] 1.17 0.000
C(season)[T.Summer] 1.55 0.000
C(season)[T.Fall] 1.05 0.006
C(day_of_week)[T.Tuesday] 0.99 0.765
C(day_of_week)[T.Wednesday] 1.03 0.224
C(day_of_week)[T.Thursday] 1.04 0.075
C(day_of_week)[T.Friday] 1.37 0.000
C(day_of_week)[T.Saturday] 1.55 0.000
C(day_of_week)[T.Sunday] 1.47 0.000
rain 0.99 0.000
snow 0.97 0.003


Base case: Average item B sales = $1554.34 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - All else being equal, on average, sales are approximately 10% higher on long weekends compared to regular weekends.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 31% higher than the base case.

season - Compared to Winter, and all else being equal: - Spring sales are 17% higher - Summer sales are 55% higher - Fall sales are 5% higher

day_of_week - No significant difference in item B sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 37% higher
    • Saturday sales are 55% higher
    • Sunday sales are 47% higher

rain - Each additional 1mm of rainfall is associated with a 1% decrease in item B sales, all else being equal.

snow - Each additional 1mm of snowfall is associated with a 3% decrease in item B sales, all else being equal.

For example:

Average item B sales on a Friday in the Fall with 3mm of rain and no snow = $1554.34 x 1.37 x 1.05 x 0.99^3 = $2169.51


3.3 Understanding item C sales

Back to top

We will be modelling overall item C sales with the same approach as above.

formula = (
    """
        log_C ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

C_model = smf.ols(formula, data=C_sales_df).fit()

print(C_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_C   R-squared:                       0.596
Model:                            OLS   Adj. R-squared:                  0.580
Method:                 Least Squares   F-statistic:                     36.93
Date:                Fri, 15 Aug 2025   Prob (F-statistic):           3.16e-56
Time:                        13:57:59   Log-Likelihood:                 467.48
No. Observations:                 339   AIC:                            -907.0
Df Residuals:                     325   BIC:                            -853.4
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.0359      0.011    618.186      0.000       7.014       7.058
is_long_weekend[T.True]         0.0142      0.016      0.889      0.375      -0.017       0.045
C(is_holiday)[T.True]           0.0968      0.021      4.518      0.000       0.055       0.139
C(season)[T.Spring]             0.0749      0.010      7.402      0.000       0.055       0.095
C(season)[T.Summer]             0.0904      0.009      9.623      0.000       0.072       0.109
C(season)[T.Fall]               0.0087      0.009      0.920      0.358      -0.010       0.027
C(day_of_week)[T.Tuesday]      -0.0076      0.013     -0.572      0.568      -0.034       0.018
C(day_of_week)[T.Wednesday]     0.0129      0.013      0.981      0.327      -0.013       0.039
C(day_of_week)[T.Thursday]      0.0128      0.013      0.974      0.331      -0.013       0.039
C(day_of_week)[T.Friday]        0.0849      0.013      6.416      0.000       0.059       0.111
C(day_of_week)[T.Saturday]      0.1485      0.013     11.061      0.000       0.122       0.175
C(day_of_week)[T.Sunday]        0.1243      0.013      9.332      0.000       0.098       0.151
rain                           -0.0007      0.001     -1.286      0.199      -0.002       0.000
snow                           -0.0115      0.006     -1.842      0.066      -0.024       0.001
==============================================================================
Omnibus:                       13.287   Durbin-Watson:                   1.290
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               16.301
Skew:                           0.360   Prob(JB):                     0.000289
Kurtosis:                       3.797   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model interpretation

To improve model performance, a log transformation was applied to item C sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.

exp_coef
coeff p-values
features
Intercept 1136.75 0.000
is_long_weekend[T.True] 1.01 0.375
C(is_holiday)[T.True] 1.10 0.000
C(season)[T.Spring] 1.08 0.000
C(season)[T.Summer] 1.09 0.000
C(season)[T.Fall] 1.01 0.358
C(day_of_week)[T.Tuesday] 0.99 0.568
C(day_of_week)[T.Wednesday] 1.01 0.327
C(day_of_week)[T.Thursday] 1.01 0.331
C(day_of_week)[T.Friday] 1.09 0.000
C(day_of_week)[T.Saturday] 1.16 0.000
C(day_of_week)[T.Sunday] 1.13 0.000
rain 1.00 0.199
snow 0.99 0.066


Base case: Average item C sales = $1136.75 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - No significant impact on sales.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 10% higher than the base case.

season - No significant difference in item C sales between Winter and Fall.

  • Compared to Winter, and all else being equal:
    • Spring sales are 8% higher
    • Summer sales are 9% higher

day_of_week - No significant difference in item C sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 9% higher
    • Saturday sales are 16% higher
    • Sunday sales are 13% higher

rain - No significant impact on sales.

snow - No significant impact on sales.

For example:

Average item C sales on a Sunday in the Spring with no snow = $1136.75 x 1.13 x 1.08 = $1387.29

Note: It’s worth noting however that the adjusted R-squared of the regression model is lower compared to the item A and item B models. This suggests certain important variables that explain item C sales are missing. Further investigation is warranted.


Appendix 1 - Statistical tests on recommendations from section 1.2

Back to top

Is the average number of orders higher during the Spring/Summer compared to the Fall/Winter?


\[\begin{aligned} \text{Null hypothesis} &: \mu_{\text{ Spring/Summer}} = \mu_{\text{ Fall/Winter}} \quad \text{(No difference in average orders)} \\ \text{Alternative hypothesis} &: \mu_{\text{ Spring/Summer}} > \mu_{\text{ Fall/Winter}} \quad \text{(Spring/Summer has higher average orders)} \end{aligned}\]


order_warm = combined_df[(combined_df['season']=='Spring') | (combined_df['season']=='Summer')]['in_store_orders']
order_cold = combined_df[(combined_df['season']=='Fall') | (combined_df['season']=='Winter')]['in_store_orders']
# test for equal variance, a ratio under 2 (or above 0.5) is considered acceptable
(order_warm.describe()['std'] / order_cold.describe()['std']).round(2)
1.21
# perform two-sample t-test with equal variance
statistic, p_value = ttest_ind(order_warm, order_cold, alternative='greater') 

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
Test statistic: 8.917
p-value: 0.000

There is statistical evidence suggesting the average number of orders is higher during the Spring/Summer compared to the Fall/Winter.


Is the average sales per order higher during holiday’s compared to Fridays and Weekends (non-Summer seasons)?


\[\begin{aligned} \text{Null hypothesis} &: \mu_{\text{ holidays}} = \mu_{\text{ weekends}} \quad \text{(No difference in average sales per order)} \\ \text{Alternative hypothesis} &: \mu_{\text{ holidays}} > \mu_{\text{ weekends}} \quad \text{(Holidays have higher average sales per order)} \end{aligned}\]


Note: We will focus our test on comparing holidays and weekends. Since weekends already exhibit higher average sales per order than Fridays, a significant difference between holidays and weekends would also imply holidays differ sigificantly from Fridays.

spo_holiday = combined_df[(combined_df['type_of_day']=='Holiday') & (combined_df['season']!='Summer')]['sales_per_order']
spo_weekend = combined_df[(combined_df['type_of_day']=='Weekend') & (combined_df['season']!='Summer')]['sales_per_order']
# test for equal variance, a ratio under 2 is considered acceptable
(spo_holiday.describe()['std'] / spo_weekend.describe()['std']).round(2)
1.31
# since our sample for holiday's is small, let's perform a Shapiro-Wilk test to check for normality
statistic, p_value = shapiro(spo_holiday)

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
if p_value < 0.05:
    print('The data is not normally distributed.')
else:
    print('The data is normally distributed.')
Test statistic: 0.896
p-value: 0.164
The data is normally distributed.
# perform two-sample t-test 
statistic, p_value = ttest_ind(spo_holiday, spo_weekend, alternative='greater') 

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
Test statistic: 1.797
p-value: 0.038

There is statistical evidence suggesting the average sales per order is higher on holidays than weekends.


Appendix 2 - Regression models - multicollinearity checks

Back to top

Multicollinearity checks assess whether two or more independent variables in a regression model are highly correlated with each other. When independent variables are strongly correlated, it becomes difficult for the model to determine their individual effects and may lead to unreliable results.

Here we will be checking with Variance Inflation Factor (VIF). As a general rule of thumb, a VIF under 5 indicates there is no concern of multicollinearity.

HCF item A / item B models

vif_df.round(3)
Features VIF
0 rain 1.685
1 snow 1.172
2 is_long_weekend_True 1.185
3 is_holiday_True 1.375
4 day_of_week_Tuesday 1.169
5 day_of_week_Wednesday 1.242
6 day_of_week_Thursday 1.179
7 day_of_week_Friday 1.258
8 day_of_week_Saturday 1.259
9 day_of_week_Sunday 1.187
10 is_HCF_True 1.657


No signs of multicollinearity among the existing variables.

Overall Item A / item B / item C sales models

vif_df.round(3)
Features VIF
0 hours_opened 17.308
1 avg_temperature 17.791
2 rain 1.357
3 snow 1.128
4 HCF_sales 1.878
5 is_long_weekend_True 1.644
6 is_holiday_True 1.543
7 season_Spring 2.403
8 season_Summer 7.138
9 season_Fall 3.488
10 day_of_week_Tuesday 2.138
11 day_of_week_Wednesday 2.160
12 day_of_week_Thursday 2.150
13 day_of_week_Friday 2.368
14 day_of_week_Saturday 2.468
15 day_of_week_Sunday 2.358


The results indicate there is multicollinearity between hours_opened, avg_temperature and season_Summer. Since hours_opened is largely explained by the day of the week and avg_temperature is correlated with seasonal effects, we will drop these two variables to reduce redundancy and improve model reliability.


vif_df.round(3)
Features VIF
0 rain 1.327
1 snow 1.116
2 HCF_sales 1.364
3 is_long_weekend_True 1.634
4 is_holiday_True 1.467
5 season_Spring 1.647
6 season_Summer 1.854
7 season_Fall 1.912
8 day_of_week_Tuesday 1.440
9 day_of_week_Wednesday 1.459
10 day_of_week_Thursday 1.432
11 day_of_week_Friday 1.461
12 day_of_week_Saturday 1.612
13 day_of_week_Sunday 1.538


There are no longer signs of multicollinearity after dropping hours_opened and avg_temperature.

Appendix 3: Regression models - residual plots

Back to top

We examine the residuals of our regression models to check for any discernible patterns. A random scatter of residuals, with no clear pattern, indicates that the model adequately captures the underlying relationships in the data and that the assumption of linearity between the target variable and independent variables is likely satisfied.

HCF - item A model

hcf_A_scatter.show()

There is no discernable pattern in the residuals.

HCF - item B model

hcf_B_scatter.show()

There is no discernable pattern in the residuals.

Item A model

A_scatter.show()

There is no discernable pattern in the residuals.

Item B model

B_scatter.show()

A couple bigger residuals for smaller fitted values, but overall no clear pattern.

Item C model

C_scatter.show()

There are signs of heteroscedasticity, as the variance of the residuals increases with larger predicted values. This suggests potential model misspecification, possibly due to missing variables in the model. Further investigation is needed to identify and address the cause.